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Abstract 

Structure learning of Bayesian networks is an important problem 
that arises in numerous machine learning applications. In this work, 
we present a novel approach for learning the structure of Bayesian 
networks using the solution of an appropriately constructed traveling 
salesman problem. In our approach, one computes an optimal ordering 
(partially ordered set) of random variables using methods for the trav- 
eling salesman problem. This ordering significantly reduces the search 
space for the subsequent greedy optimization that computes the final 
structure of the Bayesian network. We demonstrate our approach of 
learning Bayesian networks on real world census and weather datasets. 
In both cases, we demonstrate that the approach very accurately cap- 
tures dependencies between random variables. We check the accuracy 
of the predictions based on independent studies in both application 
domains. 



1 Introduction 

Bayesian networks belong to the class of probabilistic graphical models and 
can be represented as directed acyclic graphs (DAGs) [lj. They have been 
used extensively in a wide variety of applications, for instance for analysis of 
gene expression data [2], medical diagnostics [3], machine vision [3J, behavior 
of robots [5], and information retrieval [6j to name a few. 

Bayesian networks capture the joint probability distribution of the set 
X of random variables (nodes in the DAG). The edges of the DAG capture 
the dependence structure between variables. In particular, nodes that are 
not connected to one another in the DAG are conditionally independent [7] . 



Learning the structure of Bayesian networks is a challenging problem and 
has received significant attention [3 El El EH] ■ It is well known that given 
a dataset, the problem of optimally learning the associated Bayesian net- 
work structure is NP-hard [llj . Several methods to learn the structure of 
Bayesian networks have been proposed over the years. Arguably, the most 
popular and successful approaches have been built around greedy optimiza- 
tion schemes [HI H2] . Exact approaches for learning the structure of Bayesian 
networks have a scaling of 0(n2 n + n k+1 C(m)), where n is the number of 
random variables, k is the maximum in-degree and C{m) is a linear function 
of the data size m [13]. These approaches are based on solving a dynamic 
program |14j . For large Bayesian networks the above scaling for exact algo- 
rithms is prohibitive |14j . 

In this work, we present a heuristic approach for learning the structure 
of Bayesian networks from data. The approach is based on computing an or- 
dering of the random variables using the traveling salesman problem (TSP). 
Though using the ordering to learn Bayesian networks is not new [15], using 
the TSP for this task is novel. This approach provides us with the op- 
portunity to leverage efficient implementations of TSP algorithms such as 
the Lin-Kernighan heuristic^] |17| and cutting plane methods^] [19] for fast 
structure learning of Bayesian networks. 

The remainder of the paper is organized as follows. In section [2j we 
describe the approach for learning Bayesian networks using a history depen- 
dent TSP formulation. In section [3j we develop techniques for solving the 
history dependent TSP. We then present results on the Adult and El Nino 
datasets in section [4j We finally draw conclusions and discuss future work 
in section [5j 

2 Structure Learning of Bayesian Networks Using 
the Traveling Salesman Problem 

Although we use the K2 metric [10] to construct the Bayesian network, the 
only assumption our approach makes is that the scoring metric is decom- 
posable [T4"] . 

GRAPHSCORE = ^ NODESCORE(x|parents(a;)). (1) 

xdV 

1 LKH software |16| is a popular implementation of this approach 

2 Concorde TSP solver [18] is an efficient implementation of a cutting plane approach 
coupled with other heuristics 
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Thus, one can replace the K2 metric with any of the competing scoring 
functions such as BIC [20], BDeu [21], BDe |22j . and minimum description 
length [23]. 

A link between the optimal ordering and the TSP can be established on 
the basis of the decomposable metric. To find the best possible ordering O 
we start from an empty set (f>. We define the cost of going from <p to single 
random variables to be 0. Similarly, the cost of going from any permutation 
of all random variables to (j) is also defined to be 0. For any partial ordering 
of random variables O (one that does not include all random variables) we 
know that, 

V(6) = V(0\X) + Cost(X,d\X), (2) 

where X is a random variable, V is the value function, O \ X is the set O 
without X, and Cost (A", 6 \ X) is the cost of adding X to 6 \ X. 

The above dynamic program in Eqn. [2] will require 0(n 2 2 n ) opera- 
tions [13] • Instead of solving the above equation using dynamic program- 
ming, we reformulate the problem as a history dependent TSP. This is easy 
to see from Eqn. [2] by considering the random variables as cities of the tour 
and the optimal ordering of random variables as a tour that minimizes the 
overall cost (see Eqn. [3] and Fig. [I]). 



A' 



V(O) = min ^ [V(0((t + 1))) - V(6(i)) , (3) 

i=l 

The history dependence arises due to the first term in the right hand side 
of Eqn. [2] The advantage of treating this minimization as a TSP, however, 
is the ability to leverage pre-existing TSP algorithms such as LKH |16j . as 
discussed in the next section. Note that our approach provides Bayesian 
networks in which the directionality of arrows (causality) may be reversed. 
This may be attributed to the fact that, given the data, these networks are 
equally likely 



3 Solving the History Dependent Traveling Sales- 
man Problem 

The traveling salesman problem (TSP) is a classic problem that has received 
attention from the applied mathematics and computer science communities 
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(a) (b) 

Figure 1: a) Structure Learning of Bayesian networks as a dynamic pro- 
gram [14]. The permutation tree provides the order in which nodes should 
be added to the list, b) The equivalent solution of the history dependent 
TSP for the computation of the optimal ordering. 



for decades. In the traditional formulation, one is given a list of city posi- 
tions and tasked with finding a Hamiltonian cycle (a cycle that visits every 
city only once and returns to the starting city) with lowest cost [25]. Enu- 
merating all possible tours becomes infeasible for problems with more than 
10 cities. In particular, the TSP is a well studied NP-hard problem |26) . 
Over several decades, many algorithms for computing the solution of the 
TSP have been developed; for an overview we refer the reader to |19l I26|. 

To solve the history dependent TSP, we pick Helsgaun's popular version 
of the Lin-Kernighan Heuristic (LKH) |16| . which naturally extends to our 
case. LKH is a randomized approach that picks edges in the tour for removal 
and adds ones that are "more likely" to be in the optimal tour. If the 
replacement of edges reduces the cost, the change to the tour is accepted. 
The likelihood of any edge being in the optimal tour is computed using the a- 
nearness that is based on minimum 1-trees in the underlying city graph |17j . 
The LKH is the most successful approach for computing the optimal tour 
of TSPs with asymmetric cost |16j . 

In general, one replaces k edges in a simple iteration (known as fc-opt 
steps). Examples of the 2-opt and 3-opt steps are shown in Fig. [2] Note that 
using higher values of k, in general, will give tours will lower cost. However, 
as k increases, closing the tour becomes increasingly challenging [16] . 

The above approach extends naturally to the history dependent TSP. 
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Figure 2: a) 2-opt moves for the TSP. b) 3-opt moves for the TSP. 



In our problem, edges again are deleted and added randomly. Unlike the 
standard TSP, the acceptance or rejection of the edge replacement is now 
dependent on the direction as well as on the existing tour. For structure 
learning of Bayesian networks, we compared the 2-opt and 3-opt iterations 
with Helsgaun's implementation of LKH [16]. We find that despite ignoring 
history, the standard LKH software performs significantly better than our 
2-opt and 3-opt implementations with history. This is, perhaps, due to the 
fact that LKH uses sequential 5-opt steps as a basic move [16] which is 
found to provide significantly better results. If Helsgaun's LKH software 
were to be integrated with history dependent costs, it would be expected to 
provide more accurate results. This is currently part of our future efforts 
at improving this approach. Thus, all results presented here were computed 
simply by using the LKH software. 

4 Results 

We now test our approach of computing the structure of Bayesian networks 
using the history dependent TSP on the Adult and El Nino datasets, avail- 
able publicly from the UCI Machine Learning Repository |27j . 
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4.1 Adult Dataset 



The Adult dataset was extracted from census data in 1994 by Ronny Kohavi 
and Barry Becker [27]. The dataset consists of data for 48842 individuals 
and includes several attributes including occupation, salary, number of hours 
worked per week, race, native country, education, marital status etc. For a 
complete list of attributes see [27]. Unfortunately, the dataset has missing 
values i.e. entries for certain individuals are not available. We discard 
these data points to finally obtain a dataset with 30162 entries. We break 
this dataset into training (29162 entries) and testing (1000 entries) parts. 
Some of the attributes such as salary and capital gain are continuous; we 
discretize these attributes (for the number of possible states see table [I]). We 
then construct a Bayesian network using our TSP and greedy hill climbing 
approach (shown in Fig. [3]) . 



Work Class 


7 


Education 


16 


Marital Status 


7 


Occupation 


14 


Race 


5 


Capital Gain 


3 


Capital Loss 


3 


Hours /Week 


3 


Native Country 


41 


Salary 


2 



Table 1: Number of states for each random variable in the Adult dataset. 
Continuous variables have been discretized. 

The Bayesian network that is learnt using the TSP and hill climbing in 
Fig. [3] automatically captures dependencies that are now known as a result 
of several independent studies. For example, the Bayesian network captures 
the dependency between the occupation and the number of hours worked per 
week [28J. Similarly, the Bayesian network in Fig. [3] predicts dependencies 
between education and salary [29], marital status and salary [30] . occupa- 
tion and race |31j, and marital status and number of hours of worked per 
week [32J. The dependencies between race and native country, occupation 
and class of work, and salary and hours of work are obvious by definition 
and simple arithmetic respectively. Thus, we believe that the approach 
accurately captures the dependencies between random variables from raw 
data without any prior knowledge of their ordering. If one inputs an in- 
correct ordering of random variables, the quality of predicted dependencies 
degrades significantly. The comparison of resulting Bayesian networks can 
be performed using the log likelihood ratio [33] . 

To quantitatively test the prediction of the resulting Bayesian network we 
check the prediction of P(Salary|Education, Marital Status). In particular, 
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Figure 3: Structure of the Adult dataset Bayesian network learnt using the 
history dependent TSP and greedy hill climbing. 

we check the accuracy of the dependence structure on predicting whether 
individuals in the testing dataset earn more than or less than $50, 000 per 
year (thus salary takes binary states: or 1). We compute the mean square 
error using the following expression, 

MSE=-Ue(Y)-Y) 2 , (4) 

where Nt is the number of data points in the testing set, Y is the output state 
(salary in this example), and E(Y) is the expected value of Y predicted by 
the Bayesian network. The MSE for the adult dataset is 0.13. If one were to 
threshold probabilities at 0.5, i.e. if the P(Salary|Education, Marital Status) > 
$50, 000 is greater than 0.5, we assume Salary = $50, 000. In this case we 
find that our approach correctly predicts the salary 78% of the time. 

4.2 El Nino Dataset 

We now apply our algorithm to the El Nino dataset from the UCI Machine 
Learning Repository |27| . The data set consists of oceanographic and mete- 
orological readings taken by buoys in the Pacific Ocean. This large dataset 
consists of variables such as latitude, longitude, date, zonal winds and hu- 
midity (for a complete list of variables see [21] )■ In this example, we try to 
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answer the question of dependence of variables that was posed in [27J : How 
do the variables relate to each other? 

Just like in the Adult dataset example, we remove data points with miss- 
ing values and partition the states into discrete values (see table [2J. After 
data clean up, the dataset has 93935 data points that are used to learn the 
Bayesian network. We again partition the entire dataset into training (with 
92935 entries) and testing (with 1000 entries) parts. The resulting Bayesian 
network is highly interconnected as seen in Fig. |4j In particular, we find de- 
pendencies between air temperature and humidity, air temperature and sea 
surface temperature, and zonal winds and air temperature. As one would 
expect, we find dependencies between seasons and sea surface temperature, 
humidity and sea surface temperature. Note that though the predicted de- 
pendencies between seasons and longitude/latitude seem peculiar, it is to 
be expected since the buoys were not anchored at fixed locations and were 
free to drift around [27]. Previous analysis of this dataset considered only 
correlations and failed to pick up links between zonal/meridional winds and 
meteorological quantities. We, however, do find dependencies between the 
winds and meteorological quantities, suggesting a nonlinear relationship be- 
tween random variables. 



Season 


4 


Latitude 


2 


Longitude 


2 


Zonal Wind 


2 


Meridonal Wind 


2 


Humidity 


2 


Air Temperature 


2 


Sea Surface Temperature 


2 



Table 2: Number of states for each random variable in the El Nino dataset. 
Continuous variables have been discretized. 

To quantitatively test the predictions of the Bayesian network in Fig. |4j 
we concentrate on predicting zonal wind speeds using seasons and longitude. 
Using Eqn. [4j we find that predicted MSE is 0.09. If we again threshold the 
predicted values of zonal wind at P > 0.5, we find that the zonal wind is 
predicted with 89% accuracy. 

5 Conclusions and Future Work 

In this work, we have presented a new approach for learning the structure 
and parameters of a Bayesian network. The method computes an ordering 
of the random variables based on a history dependent TSP on the random 
variables. This ordering, typically supplied by domain knowledge experts, 
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Figure 4: Structure of the El Nino dataset Bayesian network learnt using 
the history dependent TSP and greedy hill climbing. 

significantly reduces the search space for hill-climbing methods. This makes 
the underlying optimization techniques effective at finding Bayesian network 
structures that maximize likelihood. 

For computing the solution of the TSP, we use the Lin-Kernighan heuris- 
tic [171 EE] with history dependent cost. The LKH approach is shown to 
extend naturally to this case. We used the TSP with greedy hill climbing 
to compute Bayesian networks to analyze the publicly available Adult and 
El Nino datasets [27] • We find that the approach successfully computes 
Bayesian networks that accurately capture the underlying system interde- 
pendencies. We check the results against common knowledge as well as 
domain specific studies. 

Future work includes the development of novel and fast heuristics for 
the history dependent TSP. There is a significant lack of methods to deal 
with this class of problems. Additionally, to provide scalability, the authors 
are investigating the utility of decentralized clustering methods |34j to learn 
Bayesian networks in distributed settings. 
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